This started out as an analysis script, but there was still some ambiguity left in how the population descriptors were coded, so we’re going to revisit it! Yay :(
knitr::opts_knit$set(root.dir="~/OneDrive - St Vincent's Institute/Documents/RNA\ Diversity/", echo=T)
# But we also have to do this manually because R studio is stupid:
setwd(knitr::opts_knit$get("root.dir"))
library(ggplot2)
library(data.table)
library(plyr)
library(reshape)
##
## Attaching package: 'reshape'
## The following objects are masked from 'package:plyr':
##
## rename, round_any
## The following object is masked from 'package:data.table':
##
## melt
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::arrange() masks plyr::arrange()
## ✖ dplyr::between() masks data.table::between()
## ✖ purrr::compact() masks plyr::compact()
## ✖ dplyr::count() masks plyr::count()
## ✖ dplyr::desc() masks plyr::desc()
## ✖ tidyr::expand() masks reshape::expand()
## ✖ dplyr::failwith() masks plyr::failwith()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ lubridate::hour() masks data.table::hour()
## ✖ dplyr::id() masks plyr::id()
## ✖ lubridate::isoweek() masks data.table::isoweek()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ lubridate::mday() masks data.table::mday()
## ✖ lubridate::minute() masks data.table::minute()
## ✖ lubridate::month() masks data.table::month()
## ✖ dplyr::mutate() masks plyr::mutate()
## ✖ lubridate::quarter() masks data.table::quarter()
## ✖ dplyr::rename() masks reshape::rename(), plyr::rename()
## ✖ lubridate::second() masks data.table::second()
## ✖ lubridate::stamp() masks reshape::stamp()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()
## ✖ purrr::transpose() masks data.table::transpose()
## ✖ lubridate::wday() masks data.table::wday()
## ✖ lubridate::week() masks data.table::week()
## ✖ lubridate::yday() masks data.table::yday()
## ✖ lubridate::year() masks data.table::year()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(viridis)
## Loading required package: viridisLite
options(device = "quartz")
allSRAFinal <- readRDS("allSRAFinalTissuesCentersDisease.rds")
Let’s look at the data using the previous population descriptors and country terms.
allSRAFinal <- allSRAFinal %>%
mutate(finalRace = gsub("or ", "or\n", finalRace)) %>%
mutate(finalGeography = gsub("and ", "and\n", finalGeography)) %>%
as.data.frame()
geographyClean <- allSRAFinal %>% drop_na(finalGeography)
raceClean <- allSRAFinal %>% drop_na(finalRace)
ggplot(geographyClean, aes(x = finalCountry, fill = finalGeography)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), alpha=0.8) +
theme_bw() +
ggtitle("Geography across all studies") +
xlab("SRA Depositor location") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position="bottom")
ggplot(raceClean, aes(x = finalCountry, fill=finalRace)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), alpha=0.8) +
theme_bw() +
ggtitle("Race across all studies") +
xlab("SRA Depositor location") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
scale_fill_viridis(discrete=T, na.value="grey50") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position="bottom")
More clearly, who is sequencing each pop descriptor?
ggplot(geographyClean, aes(fill = finalCountry, x = finalGeography)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), alpha=0.8) +
theme_bw() +
ggtitle("Country of sequencing across descriptors") +
xlab("Geographic descriptor") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
# scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position="bottom")
ggplot(raceClean, aes(fill = finalCountry, x = finalRace)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), alpha=0.8) +
theme_bw() +
ggtitle("Country of sequencing across descriptors") +
xlab("SRA Depositor location") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
# scale_fill_viridis(discrete=T, na.value="grey50") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position="bottom")
So… after looking at this, it’s clear there’s still some noise in the descriptor classifications, primarily with who we are labeling as ‘white’ vs European… It’s worth looking at some of these studies more closely. For instance, we know Singapore and Brazil both use the term ‘white’ but it’s very unlikely to be referring to US census categories, and instead reflects their own legal recognition.
First, it’s worth remembering how the labels are made: finalGeography is made by looking at data in columns ETHNICITY, DONOR_ETHNICITY, ancestry and Population, whereas finalRace is made from RACE, primary_race, reported_race, race.ethnicity and donor_race. There’s a lot of downstream clean-up involved, which is perhaps a problem.
There are a couple of options here:
Let’s go in order:
This method is worth exploring, to see how things look if we are strict (worth reporting in a supplement). All of the code used here and below is borrowed from 02_clean_ancestry_terms.Rmd but without most of the intermediate qc because this is meant to be brutal and automatic.
This is coded as IGR.term.strictest
popDescriptors <- read.csv("20240702_descriptor_terms_new_schema.csv")
allSRAFinal$mergedGeography <- coalesce(allSRAFinal$ancestry, allSRAFinal$Population) %>%
coalesce(., allSRAFinal$ETHNICITY)
allSRAFinal$mergedRace <- coalesce(allSRAFinal$RACE, allSRAFinal$race.ethnicity) %>%
coalesce(., allSRAFinal$DONOR_ETHNICITY) %>%
coalesce(., allSRAFinal$reported_race) %>%
coalesce(., allSRAFinal$primary_race)
popGeoNew <- popDescriptors[popDescriptors$IGR.coding.strictest %in% "geography",]
popRaceNew <- popDescriptors[popDescriptors$IGR.coding.strictest %in% "race",]
# And now we replace those labels with the ones in the schema:
allSRAFinal$strictestGeography <- popGeoNew[match(allSRAFinal$mergedGeography, popGeoNew$Term),]$IGR.term.strictest
allSRAFinal$strictestRace <- popRaceNew[match(allSRAFinal$mergedRace, popRaceNew$Term),]$IGR.term.strictest
# At this point I originally did a lot of swapping of terms back and forth, but I believe in the strictest case we shouldn't do too much of that, just make sure there's nothing with a double assignment:
# table(allSRAFinal$strictestGeography)
# table(allSRAFinal$strictestRace)
#
# table(!is.na(allSRAFinal$strictestGeography), !is.na(allSRAFinal$mergedGeography))
# table(!is.na(allSRAFinal$strictestRace), !is.na(allSRAFinal$mergedRace))
# We can quickly check what becomes what, with v2 being the original and v1 being the new one:
geographyTerms <- as.data.frame(table(allSRAFinal$strictestGeography, allSRAFinal$mergedGeography)) %>% .[.$Freq > 0,]
raceTerms <- as.data.frame(table(allSRAFinal$strictestRace, allSRAFinal$mergedRace)) %>% .[.$Freq > 0,]
# geographyTerms
# raceTerms
# There's a few studies with joint assignments...
table(!is.na(allSRAFinal$strictestGeography), !is.na(allSRAFinal$strictestRace))
##
## FALSE TRUE
## FALSE 535 13076
## TRUE 9397 1020
jointTerms <- as.data.frame(table(allSRAFinal$strictestRace, allSRAFinal$strictestGeography)) %>% .[.$Freq > 0,]
jointTerms
## Var1 Var2 Freq
## 3 Black or African American Americas 3
## 6 Other Americas 2
## 7 White Americas 10
## 9 Asian Asia 39
## 16 Asian East Asia 246
## 24 Black or African American Europe 10
## 28 White Europe 607
## 44 Asian Oceania 6
## 73 Black or African American Subsaharan Africa 97
mismatches <- allSRAFinal[!is.na(allSRAFinal$strictestGeography) & !is.na(allSRAFinal$strictestRace),]
table(mismatches$SRA.Study) # Quite a lot of studies... where are they from?
##
## SRP068551 SRP103099 SRP107326 SRP122876 SRP150833 SRP159625 SRP179998 SRP241159 SRP257773 SRP267193 SRP291048 SRP292867 SRP301528 SRP342015 SRP347253 SRP362734
## 18 12 208 16 24 10 15 251 6 5 4 24 60 24 304 39
mismatches %>% count(finalCountry, SRA.Study, strictestGeography, strictestRace)
## finalCountry SRA.Study strictestGeography strictestRace n
## 1 Brazil SRP179998 Americas Black or African American 3
## 2 Brazil SRP179998 Americas Other 2
## 3 Brazil SRP179998 Americas White 10
## 4 China SRP068551 East Asia Asian 18
## 5 China SRP362734 Asia Asian 39
## 6 Denmark SRP122876 Europe White 16
## 7 Germany SRP342015 Europe White 24
## 8 Korea SRP292867 East Asia Asian 20
## 9 Korea SRP292867 Europe White 4
## 10 Russia SRP150833 Europe White 24
## 11 Spain SRP301528 Europe White 60
## 12 Taiwan SRP107326 East Asia Asian 208
## 13 USA SRP103099 Europe White 11
## 14 USA SRP103099 Subsaharan Africa Black or African American 1
## 15 USA SRP159625 Europe Black or African American 10
## 16 USA SRP241159 Europe White 164
## 17 USA SRP241159 Subsaharan Africa Black or African American 87
## 18 USA SRP257773 Oceania Asian 6
## 19 USA SRP267193 Subsaharan Africa Black or African American 5
## 20 USA SRP291048 Subsaharan Africa Black or African American 4
## 21 USA SRP347253 Europe White 304
# So in the strictest case, I am making the call that if the study contained both valid labels and is outside the USA, it goes to geography, if it's inside the USA it goes to race.
geoSRA <- unique(mismatches[!mismatches$finalCountry %in% "USA",]$SRA.Study)
raceSRA <- unique(mismatches[mismatches$finalCountry %in% "USA",]$SRA.Study)
allSRAFinal[allSRAFinal$SRA.Study %in% geoSRA,]$strictestRace <- NA
allSRAFinal[allSRAFinal$SRA.Study %in% raceSRA,]$strictestGeography <- NA
table(!is.na(allSRAFinal$strictestGeography), !is.na(allSRAFinal$strictestRace)) # Done... ish
##
## FALSE TRUE
## FALSE 535 13668
## TRUE 9825 0
# Once we decide what to do with the hispanics, it should resolve this problem:
allSRAFinal[is.na(allSRAFinal$strictestGeography) & is.na(allSRAFinal$strictestRace),] %>% count(finalRace)
## finalRace n
## 1 American Indian or\nAlaskan Native 16
## 2 <NA> 519
allSRAFinal[is.na(allSRAFinal$strictestGeography) & is.na(allSRAFinal$strictestRace),] %>% count(finalGeography)
## finalGeography n
## 1 <NA> 535
allSRAFinal[is.na(allSRAFinal$strictestGeography) & is.na(allSRAFinal$strictestRace),] %>% count(hispanic)
## hispanic n
## 1 hispanic 387
## 2 non.hispanic 132
## 3 <NA> 16
Only if it really matches the US Census term in the geography columns, or if it really doesn’t match the census terms in the race columns, are things switched over.
This is coded as IGR.term.strict
popGeoNew <- popDescriptors[popDescriptors$IGR.coding.strict %in% "geography",]
popRaceNew <- popDescriptors[popDescriptors$IGR.coding.strict %in% "race",]
allSRAFinal$strictGeography <- popGeoNew[match(allSRAFinal$mergedGeography, popGeoNew$Term),]$IGR.term.strict
allSRAFinal$strictRace <- popRaceNew[match(allSRAFinal$mergedRace, popRaceNew$Term),]$IGR.term.strict
table(allSRAFinal$strictGeography)
##
## Americas Asia East Asia Europe Multiple North Africa and Western Asia
## 164 304 1121 5936 206 18
## Other South Asia Southeast Asia Subsaharan Africa
## 62 718 49 600
table(allSRAFinal$strictRace)
##
## American Indian and Alaska Native Asian Black or African American Multiple
## 65 943 1816 237
## Native Hawaiian and other Pacific Islander Other White
## 8 217 10135
# table(!is.na(allSRAFinal$strictGeography), !is.na(allSRAFinal$mergedGeography))
# table(!is.na(allSRAFinal$strictRace), !is.na(allSRAFinal$mergedRace))
Now we have two problems - terms in the wrong column, and also double labels. Let’s start with things that had race/geography info in the relevant columns before the recode, but weren’t parsed correctly, ie we decided it was racial when it was coded as ethnicity or viceversa.
# We start with race
allSRAFinal$raceFails <- (!is.na(allSRAFinal$mergedRace) & is.na(allSRAFinal$strictRace))
allSRAFinal$geographyFails <- (!is.na(allSRAFinal$mergedGeography) & is.na(allSRAFinal$strictGeography))
raceOK <- allSRAFinal[allSRAFinal$raceFails == FALSE,] #These got classified correctly, we think.
raceOK$rescueGeography <- raceOK$strictGeography
raceFails <- allSRAFinal[allSRAFinal$raceFails == TRUE,]
raceFails$rescueGeography <- popGeoNew[match(raceFails$mergedRace, popGeoNew$Term),]$IGR.term.strict
allSRATemp <- rbind(raceOK, raceFails)
# Now we do the same to the mismatches in the geography term:
geographyOK <- allSRATemp[allSRATemp$geographyFails == FALSE,]
geographyOK$rescueRace <- geographyOK$strictRace
geographyFails <- allSRATemp[allSRATemp$geographyFails == TRUE,]
geographyFails$rescueRace <- popRaceNew[match(geographyFails$mergedGeography, popRaceNew$Term),]$IGR.term.strict
allSRATemp <- rbind(geographyOK, geographyFails)
# allSRATemp %>% count(strictRace, rescueRace)
# allSRATemp %>% count(strictGeography, rescueGeography)
allSRATemp$strictRace <- coalesce(allSRATemp$strictRace, allSRATemp$rescueRace)
# table(allSRATemp$finalRace)
# table(allSRATemp$strictRace)
#
# table(allSRATemp$finalGeography)
# table(allSRATemp$strictGeography)
allSRAFinal <- allSRATemp
(rm(list=c("allSRATemp", "raceOK", "raceFails", "geographyOK", "geographyFails")))
## NULL
#And now we deal with things with both kinds of info:
table(!is.na(allSRAFinal$strictGeography), !is.na(allSRAFinal$strictRace))
##
## FALSE TRUE
## FALSE 871 13979
## TRUE 8591 587
jointTerms <- as.data.frame(table(allSRAFinal$strictRace, allSRAFinal$strictGeography)) %>% .[.$Freq > 0,]
jointTerms
## Var1 Var2 Freq
## 3 Black or African American Americas 3
## 7 White Americas 10
## 9 Asian Asia 39
## 16 Asian East Asia 246
## 24 Black or African American Europe 10
## 28 White Europe 279
mismatches <- allSRAFinal[!is.na(allSRAFinal$strictGeography) & !is.na(allSRAFinal$strictRace),]
table(mismatches$SRA.Study) # Quite a lot of studies... where are they from?
##
## SRP068551 SRP103099 SRP107326 SRP122876 SRP150833 SRP159625 SRP179998 SRP241159 SRP292867 SRP301528 SRP362734
## 18 11 208 16 24 10 13 164 24 60 39
mismatches %>% count(finalCountry, SRA.Study, strictGeography, strictRace)
## finalCountry SRA.Study strictGeography strictRace n
## 1 Brazil SRP179998 Americas Black or African American 3
## 2 Brazil SRP179998 Americas White 10
## 3 China SRP068551 East Asia Asian 18
## 4 China SRP362734 Asia Asian 39
## 5 Denmark SRP122876 Europe White 16
## 6 Korea SRP292867 East Asia Asian 20
## 7 Korea SRP292867 Europe White 4
## 8 Russia SRP150833 Europe White 24
## 9 Spain SRP301528 Europe White 60
## 10 Taiwan SRP107326 East Asia Asian 208
## 11 USA SRP103099 Europe White 11
## 12 USA SRP159625 Europe Black or African American 10
## 13 USA SRP241159 Europe White 164
# In this case we should be more nuanced, rather than simply shoving things one way or another on the basis of depositing country (although I am tempted):
# by(mismatches, mismatches$SRA.Study, function(x) head(x, n=10))
Some familiar names in here that I’ve had to decide on before, and new entries:
allSRAFinal[allSRAFinal$SRA.Study %in% c("SRP068551", "SRP107326", "SRP122876", "SRP150833", "SRP292867", "SRP179998", "SRP362734", "SRP301528"),]$strictRace <- NA
allSRAFinal[allSRAFinal$SRA.Study %in% c("SRP068551", "SRP159625", "SRP241159"),]$strictGeography <- NA
Not repeating this, since it’s what we did to begin with. Captured in the columns finalRace and finalGeography
Now we compare all three approaches.
We begin with the easy one:
Reminder that final is the most artisanal, strict is a bit more conservative and strictest does not move terms across columns even when these are patently in the wrong place.
# Let's clean up the terms again
allSRAFinal <- allSRAFinal %>%
mutate(strictRace = gsub("or ", "or\n", strictRace)) %>%
mutate(strictestRace = gsub("or ", "or\n", strictestRace)) %>%
mutate(strictRace = gsub("and ", "and\n", strictRace)) %>%
mutate(strictestRace = gsub("and ", "and\n", strictestRace)) %>%
mutate(strictGeography = gsub("and ", "and\n", strictGeography)) %>%
mutate(strictestGeography = gsub("and ", "and\n", strictestGeography)) %>%
as.data.frame()
# And let's remove the intermediate columns because they're annoying:
allSRAFinal <- allSRAFinal[,c(1:41,44:47)]
# Some big changes in White/European and Black/Subsaharan Africa if we are very strict with terms...
allSRAFinal %>% count(finalGeography, strictGeography, strictestGeography)
## finalGeography strictGeography strictestGeography n
## 1 Americas Americas Americas 164
## 2 Asia Asia Asia 304
## 3 East Asia East Asia East Asia 1103
## 4 East Asia <NA> East Asia 18
## 5 East Asia <NA> <NA> 212
## 6 Europe Europe Europe 2952
## 7 Europe Europe <NA> 304
## 8 Europe <NA> <NA> 25
## 9 Multiple Multiple Multiple 206
## 10 North Africa and\nWestern Asia North Africa and\nWestern Asia North Africa and\nWestern Asia 18
## 11 North Africa and\nWestern Asia South Asia South Asia 8
## 12 North Africa and\nWestern Asia <NA> <NA> 1
## 13 Other Other Other 1
## 14 South Asia South Asia South Asia 710
## 15 South Asia <NA> <NA> 63
## 16 Southeast Asia Southeast Asia Southeast Asia 49
## 17 Southeast Asia <NA> <NA> 34
## 18 Subsaharan Africa Subsaharan Africa Subsaharan Africa 514
## 19 Subsaharan Africa Subsaharan Africa <NA> 9
## 20 Subsaharan Africa <NA> <NA> 1
## 21 <NA> Europe Europe 2495
## 22 <NA> Europe <NA> 11
## 23 <NA> Other Other 61
## 24 <NA> Subsaharan Africa Subsaharan Africa 77
## 25 <NA> <NA> Americas 2
## 26 <NA> <NA> Asia 6
## 27 <NA> <NA> Europe 22
## 28 <NA> <NA> Multiple 1
## 29 <NA> <NA> Oceania 11
## 30 <NA> <NA> Subsaharan Africa 1103
## 31 <NA> <NA> <NA> 13543
allSRAFinal %>% count(finalRace, strictRace, strictestRace)
## finalRace strictRace strictestRace n
## 1 American Indian or\nAlaskan Native American Indian and\nAlaska Native American Indian and\nAlaska Native 65
## 2 American Indian or\nAlaskan Native American Indian and\nAlaska Native <NA> 2
## 3 American Indian or\nAlaskan Native <NA> <NA> 16
## 4 Asian Asian Asian 658
## 5 Black or\nAfrican American Black or\nAfrican American Black or\nAfrican American 1813
## 6 Black or\nAfrican American Black or\nAfrican American <NA> 1103
## 7 Black or\nAfrican American <NA> <NA> 77
## 8 Multiple Multiple Multiple 237
## 9 Multiple Multiple <NA> 7
## 10 Native Hawaiian or\nother Pacific Islander Native Hawaiian and\nother Pacific Islander Native Hawaiian and\nother Pacific Islander 8
## 11 Native Hawaiian or\nother Pacific Islander Native Hawaiian and\nother Pacific Islander <NA> 11
## 12 Other Other Other 217
## 13 Other <NA> <NA> 61
## 14 White White White 10021
## 15 White White <NA> 22
## 16 White <NA> <NA> 2495
## 17 <NA> <NA> Asian 309
## 18 <NA> <NA> Black or\nAfrican American 10
## 19 <NA> <NA> White 330
## 20 <NA> <NA> <NA> 6566
# Repeat the same plots as above:
sampleGeography <- allSRAFinal %>% count(SRA.Study, finalGeography, strictGeography, strictestGeography)
meltSampleGeography <- melt(sampleGeography)
## Using SRA.Study, finalGeography, strictGeography, strictestGeography as id variables
# First, by the original classification: finalGeography
meltSampleGeography %>% drop_na(finalGeography) %>%
ggplot(., aes(x = finalGeography, y=value, fill=finalGeography)) +
geom_bar(stat="identity", alpha=0.6) +
theme_bw() +
ggtitle("finalGeography across all studies") +
# xlab("System") +
ylab("Count") +
# scale_y_continuous(trans='log10') +
scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "none")
meltSampleGeography %>% drop_na(strictGeography) %>%
ggplot(., aes(x = strictGeography, y=value, fill=strictGeography)) +
geom_bar(stat="identity", alpha=0.6) +
theme_bw() +
ggtitle("strictGeography across all studies") +
# xlab("System") +
ylab("Count") +
# scale_y_continuous(trans='log10') +
scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "none")
meltSampleGeography %>% drop_na(strictestGeography) %>%
ggplot(., aes(x = strictestGeography, y=value, fill=strictestGeography)) +
geom_bar(stat="identity", alpha=0.6) +
theme_bw() +
ggtitle("strictestGeography across all studies") +
# xlab("System") +
ylab("Count") +
# scale_y_continuous(trans='log10') +
scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "none")
# And now the same for race:
sampleRace <- allSRAFinal %>% count(SRA.Study, finalRace, strictRace, strictestRace)
meltSampleRace <- melt(sampleRace)
## Using SRA.Study, finalRace, strictRace, strictestRace as id variables
# First, by the original classification: finalRace
meltSampleRace %>% drop_na(finalRace) %>%
ggplot(., aes(x = finalRace, y=value, fill=finalRace)) +
geom_bar(stat="identity", alpha=0.6) +
theme_bw() +
ggtitle("finalRace across all studies") +
# xlab("System") +
ylab("Count") +
# scale_y_continuous(trans='log10') +
scale_fill_viridis(discrete=T, na.value="grey50") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "none")
meltSampleRace %>% drop_na(strictRace) %>%
ggplot(., aes(x = strictRace, y=value, fill=strictRace)) +
geom_bar(stat="identity", alpha=0.6) +
theme_bw() +
ggtitle("strictRace across all studies") +
# xlab("System") +
ylab("Count") +
# scale_y_continuous(trans='log10') +
scale_fill_viridis(discrete=T, na.value="grey50") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "none")
meltSampleRace %>% drop_na(strictestRace) %>%
ggplot(., aes(x = strictestRace, y=value, fill=strictestRace)) +
geom_bar(stat="identity", alpha=0.6) +
theme_bw() +
ggtitle("strictestRace across all studies") +
# xlab("System") +
ylab("Count") +
# scale_y_continuous(trans='log10') +
scale_fill_viridis(discrete=T, na.value="grey50") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "none")
allSRAFinal %>% group_by(finalCountry, finalGeography) %>% summarise(n = sum(!is.na(finalGeography))) %>% drop_na(finalGeography) %>%
ggplot(aes(x = finalCountry, y = n, fill=finalGeography)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
theme_bw() +
ggtitle("finalGeography across all studies") +
xlab("SRA Depositor location") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.
allSRAFinal %>% group_by(finalCountry, strictGeography) %>% summarise(n = sum(!is.na(strictGeography))) %>% drop_na(strictGeography) %>%
ggplot(aes(x = finalCountry, y = n, fill=strictGeography)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
theme_bw() +
ggtitle("strictGeography across all studies") +
xlab("SRA Depositor location") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.
allSRAFinal %>% group_by(finalCountry, strictestGeography) %>% summarise(n = sum(!is.na(strictestGeography))) %>% drop_na(strictestGeography) %>%
ggplot(aes(x = finalCountry, y = n, fill=strictestGeography)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
theme_bw() +
ggtitle("strictestGeography across all studies") +
xlab("SRA Depositor location") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.
allSRAFinal %>% group_by(finalCountry, finalRace) %>% summarise(n = sum(!is.na(finalRace))) %>% drop_na(finalRace) %>%
ggplot(aes(x = finalCountry, y = n, fill=finalRace)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
theme_bw() +
ggtitle("finalRace across all studies") +
xlab("SRA Depositor location") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
scale_fill_viridis(discrete=T, na.value="grey50") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.
allSRAFinal %>% group_by(finalCountry, strictRace) %>% summarise(n = sum(!is.na(strictRace))) %>% drop_na(strictRace) %>%
ggplot(aes(x = finalCountry, y = n, fill=strictRace)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
theme_bw() +
ggtitle("strictRace across all studies") +
xlab("SRA Depositor location") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
scale_fill_viridis(discrete=T, na.value="grey50") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.
allSRAFinal %>% group_by(finalCountry, strictestRace) %>% summarise(n = sum(!is.na(strictestRace))) %>% drop_na(strictestRace) %>%
ggplot(aes(x = finalCountry, y = n, fill=strictestRace)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
theme_bw() +
ggtitle("strictestRace across all studies") +
xlab("SRA Depositor location") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
scale_fill_viridis(discrete=T, na.value="grey50") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.
Now we flip the axes and the groupings:
allSRAFinal %>% group_by(finalCountry, finalGeography) %>% summarise(n = sum(!is.na(finalGeography))) %>% drop_na(finalGeography) %>%
ggplot(aes(fill = finalCountry, y = n, x=finalGeography)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
theme_bw() +
ggtitle("finalGeography across all studies") +
xlab("Population descriptor") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
# scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.
allSRAFinal %>% group_by(finalCountry, strictGeography) %>% summarise(n = sum(!is.na(strictGeography))) %>% drop_na(strictGeography) %>%
ggplot(aes(fill = finalCountry, y = n, x=strictGeography)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
theme_bw() +
ggtitle("strictGeography across all studies") +
xlab("Population descriptor") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
# scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.
allSRAFinal %>% group_by(finalCountry, strictestGeography) %>% summarise(n = sum(!is.na(strictestGeography))) %>% drop_na(strictestGeography) %>%
ggplot(aes(fill = finalCountry, y = n, x=strictestGeography)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
theme_bw() +
ggtitle("strictestGeography across all studies") +
xlab("Population descriptor") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
# scale_fill_viridis(discrete=T, na.value="grey50", option="plasma") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.
allSRAFinal %>% group_by(finalCountry, finalRace) %>% summarise(n = sum(!is.na(finalRace))) %>% drop_na(finalRace) %>%
ggplot(aes(fill = finalCountry, y = n, x=finalRace)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
theme_bw() +
ggtitle("finalRace across all studies") +
xlab("Population descriptor") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
# scale_fill_viridis(discrete=T, na.value="grey50") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.
allSRAFinal %>% group_by(finalCountry, strictRace) %>% summarise(n = sum(!is.na(strictRace))) %>% drop_na(strictRace) %>%
ggplot(aes(fill = finalCountry, y = n, x=strictRace)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
theme_bw() +
ggtitle("strictRace across all studies") +
xlab("Population descriptor") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
# scale_fill_viridis(discrete=T, na.value="grey50") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.
allSRAFinal %>% group_by(finalCountry, strictestRace) %>% summarise(n = sum(!is.na(strictestRace))) %>% drop_na(strictestRace) %>%
ggplot(aes(fill = finalCountry, y = n, x=strictestRace)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single"), stat="identity", alpha=0.8) +
theme_bw() +
ggtitle("strictestRace across all studies") +
xlab("Population descriptor") +
ylab("Samples (log10)") +
scale_y_continuous(trans='log10') +
# scale_fill_viridis(discrete=T, na.value="grey50") +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
theme(legend.position = "bottom")
## `summarise()` has grouped output by 'finalCountry'. You can override using the `.groups` argument.
allSRAFinal %>% group_by(SRA.Study, finalGeography, finalRace) %>%
summarise(n = n()) %>%
group_by(SRA.Study) %>%
summarise(nGeo = sum(n[!is.na(finalGeography)]), nRace = sum(n[!is.na(finalRace)])) %>%
filter(if_all(contains("n"), ~ .x > 0)) %>%
as.data.frame() %>%
melt() %>%
ggplot(., aes(x = SRA.Study, y = value, fill=variable)) +
geom_bar(stat="identity") +
theme_bw() +
ggtitle("Studies with both finalRace and finalGeography assignments") +
ylab("Count") +
scale_fill_brewer(palette = "Set1") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(legend.position="bottom")
## `summarise()` has grouped output by 'SRA.Study', 'finalGeography'. You can override using the `.groups` argument.
## Using SRA.Study as id variables
allSRAFinal %>% group_by(SRA.Study, strictGeography, strictRace) %>%
summarise(n = n()) %>%
group_by(SRA.Study) %>%
summarise(nGeo = sum(n[!is.na(strictGeography)]), nRace = sum(n[!is.na(strictRace)])) %>%
filter(if_all(contains("n"), ~ .x > 0)) %>%
as.data.frame() %>%
melt() %>%
ggplot(., aes(x = SRA.Study, y = value, fill=variable)) +
geom_bar(stat="identity") +
theme_bw() +
ggtitle("Studies with both strictRace and strictGeography assignments") +
ylab("Count") +
scale_fill_brewer(palette = "Set1") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(legend.position="bottom")
## `summarise()` has grouped output by 'SRA.Study', 'strictGeography'. You can override using the `.groups` argument.
## Using SRA.Study as id variables
# Can't be plotted because there are none.
try(allSRAFinal %>% group_by(SRA.Study, strictestGeography, strictestRace) %>%
summarise(n = n()) %>%
group_by(SRA.Study) %>%
summarise(nGeo = sum(n[!is.na(strictestGeography)]), nRace = sum(n[!is.na(strictestRace)])) %>%
filter(if_all(contains("n"), ~ .x > 0)) %>%
as.data.frame() %>%
melt() %>%
ggplot(., aes(x = SRA.Study, y = value, fill=variable)) +
geom_bar(stat="identity") +
theme_bw() +
ggtitle("Studies with both strictestRace and strictestGeography assignments") +
ylab("Count") +
scale_fill_brewer(palette = "Set1") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
theme(legend.position="bottom"))
## `summarise()` has grouped output by 'SRA.Study', 'strictestGeography'. You can override using the `.groups` argument.
## Using SRA.Study as id variables
## Error in data.frame(ids, x, data[, x]) :
## arguments imply differing number of rows: 0, 1
Stopping here for now - I am feeling like strictest is my favourite, actually. I had previously looked at how to resolve some of the conflicted studies (at the final level, there’s some new ones in the strict level), but hadn’t implemented it…
But now we gotta deal with one final hurdle: hispanic:
# How many studies have hispanic info and race or ethnicity info?
allSRAFinal %>% count(strictestRace, hispanic)
## strictestRace hispanic n
## 1 American Indian and\nAlaska Native hispanic 22
## 2 American Indian and\nAlaska Native non.hispanic 1
## 3 American Indian and\nAlaska Native <NA> 42
## 4 Asian hispanic 23
## 5 Asian non.hispanic 255
## 6 Asian <NA> 689
## 7 Black or\nAfrican American hispanic 81
## 8 Black or\nAfrican American non.hispanic 334
## 9 Black or\nAfrican American <NA> 1408
## 10 Multiple hispanic 34
## 11 Multiple non.hispanic 44
## 12 Multiple <NA> 159
## 13 Native Hawaiian and\nother Pacific Islander hispanic 2
## 14 Native Hawaiian and\nother Pacific Islander non.hispanic 4
## 15 Native Hawaiian and\nother Pacific Islander <NA> 2
## 16 Other hispanic 55
## 17 Other non.hispanic 132
## 18 Other <NA> 30
## 19 White hispanic 651
## 20 White non.hispanic 3672
## 21 White <NA> 6028
## 22 <NA> hispanic 387
## 23 <NA> non.hispanic 166
## 24 <NA> <NA> 9807
allSRAFinal %>% count(strictestGeography, hispanic)
## strictestGeography hispanic n
## 1 Americas <NA> 166
## 2 Asia <NA> 310
## 3 East Asia <NA> 1121
## 4 Europe non.hispanic 34
## 5 Europe <NA> 5435
## 6 Multiple <NA> 207
## 7 North Africa and\nWestern Asia <NA> 18
## 8 Oceania <NA> 11
## 9 Other <NA> 62
## 10 South Asia <NA> 718
## 11 Southeast Asia <NA> 49
## 12 Subsaharan Africa <NA> 1694
## 13 <NA> hispanic 1255
## 14 <NA> non.hispanic 4574
## 15 <NA> <NA> 8374
allSRAFinal %>% count(hispanic, strictestRace)
## hispanic strictestRace n
## 1 hispanic American Indian and\nAlaska Native 22
## 2 hispanic Asian 23
## 3 hispanic Black or\nAfrican American 81
## 4 hispanic Multiple 34
## 5 hispanic Native Hawaiian and\nother Pacific Islander 2
## 6 hispanic Other 55
## 7 hispanic White 651
## 8 hispanic <NA> 387
## 9 non.hispanic American Indian and\nAlaska Native 1
## 10 non.hispanic Asian 255
## 11 non.hispanic Black or\nAfrican American 334
## 12 non.hispanic Multiple 44
## 13 non.hispanic Native Hawaiian and\nother Pacific Islander 4
## 14 non.hispanic Other 132
## 15 non.hispanic White 3672
## 16 non.hispanic <NA> 166
## 17 <NA> American Indian and\nAlaska Native 42
## 18 <NA> Asian 689
## 19 <NA> Black or\nAfrican American 1408
## 20 <NA> Multiple 159
## 21 <NA> Native Hawaiian and\nother Pacific Islander 2
## 22 <NA> Other 30
## 23 <NA> White 6028
## 24 <NA> <NA> 9807
# Not sure if this is due to how we coded it, but that's pleasant. Except, what to do with the people with two labels??
saveRDS(allSRAFinal, file="20240901_allSRAFinal.rds")